For this particular project, we are looking at the stock prices of four major tech companies: Apple, Facebook, Amazon and Google, from the years of, correspondingly, 1980, 2012, 1997 and 2004 to the year of 2018. With the data downloaded from Kaggle (*https://www.kaggle.com/stexo92/gafa-stock-prices*), we got access to the dates, opening prices, closing prices, stock volume, highest/lowest prices and adjusted close prices of these companies.
Based on the nature of stock market, there can potentially be temporal structures when analyzing and predicting stock prices. We take the closing prices of the stocks as the response variable and try to fit appropriate models to help determine what affects the close stock prices for each company, as well as understanding the temporal structures within.
We will first look at some Explanatory Data Analyses for all four companies, then try to fit simpler models with no temporal structures, and finally fit and evaluate temporal models for each company. For the temporal model fitting part specifically, we will try two different methods: both auto-fitting ARIMA models, as well as models with Gaussian Process. We would then compare the performances of all three types models fit by both methods.
Lastly, we wish to come to a conclusion for our questions of interest: what are the factors that can potentially affect the closing prices of stocks? Is there any temporal dependency in the closing prices? Are there differences among different companies or they share similar trends and structures in their stocks? We would also have a discussion on the adequacy, potential problems with the models and provide suggestions for developing this project.
summary(df)
## Stock Date Open High
## Amazon :1497 Min. :2012-05-09 Min. : 18.08 Min. : 18.27
## Apple :1497 1st Qu.:2013-11-05 1st Qu.: 97.60 1st Qu.: 98.60
## Facebook:1490 Median :2015-05-04 Median : 212.14 Median : 215.35
## Google :1497 Mean :2015-05-02 Mean : 351.09 Mean : 354.08
## 3rd Qu.:2016-10-25 3rd Qu.: 552.10 3rd Qu.: 555.26
## Max. :2018-04-20 Max. :1615.96 Max. :1617.54
## Low Close Adj.Close Volume
## Min. : 17.55 Min. : 17.73 Min. : 17.73 Min. : 7900
## 1st Qu.: 96.58 1st Qu.: 97.50 1st Qu.: 94.30 1st Qu.: 2745600
## Median : 207.75 Median : 212.91 Median : 212.91 Median : 10815000
## Mean : 347.75 Mean : 351.06 Mean : 349.02 Mean : 26572272
## 3rd Qu.: 545.33 3rd Qu.: 551.97 3rd Qu.: 551.97 3rd Qu.: 37140600
## Max. :1590.89 Max. :1598.39 Max. :1598.39 Max. :573576400
## diff
## Min. :-79.18005
## 1st Qu.: -1.34998
## Median : 0.01001
## Mean : -0.02581
## 3rd Qu.: 1.46000
## Max. : 81.38000
ggplot(data = df, aes(x = Stock, y = Close)) +
geom_boxplot() +
ggtitle("Different Close Price Distribution of Different Stocks")
p1 = ggplot(data = df %>% filter(Stock == "Amazon") %>% arrange(Date), aes(x = Date, y = Close)) +
geom_line() +
ggtitle("Stock Close Price of Amazon")
p2 = ggplot(data = df %>% filter(Stock == "Apple") %>% arrange(Date), aes(x = Date, y = Close)) +
geom_line() +
ggtitle("Stock Close Price of Apple")
p3 = ggplot(data = df %>% filter(Stock == "Google") %>% arrange(Date), aes(x = Date, y = Close)) +
geom_line() +
ggtitle("Stock Close Price of Google")
p4 = ggplot(data = df %>% filter(Stock == "Facebook") %>% arrange(Date), aes(x = Date, y = Close)) +
geom_line() +
ggtitle("Stock Close Price of Facebook")
grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)
p1 = ggplot(data = df %>% filter(Stock == "Amazon") %>% arrange(Date), aes(x = Date, y = Open)) +
geom_line() +
ggtitle("Stock Open Price of Amazon")
p2 = ggplot(data = df %>% filter(Stock == "Apple") %>% arrange(Date), aes(x = Date, y = Open)) +
geom_line() +
ggtitle("Stock Open Price of Apple")
p3 = ggplot(data = df %>% filter(Stock == "Google") %>% arrange(Date), aes(x = Date, y = Open)) +
geom_line() +
ggtitle("Stock Open Price of Google")
p4 = ggplot(data = df %>% filter(Stock == "Facebook") %>% arrange(Date), aes(x = Date, y = Open)) +
geom_line() +
ggtitle("Stock Open Price of Facebook")
grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)
p1 = ggplot(data = df %>% filter(Stock == "Amazon") %>% arrange(Date), aes(x = Date, y = diff)) +
geom_line() +
ggtitle("Stock Price Daily Change of Amazon")
p2 = ggplot(data = df %>% filter(Stock == "Apple") %>% arrange(Date), aes(x = Date, y = diff)) +
geom_line() +
ggtitle("Stock Price Daily Change of Apple")
p3 = ggplot(data = df %>% filter(Stock == "Google") %>% arrange(Date), aes(x = Date, y = diff)) +
geom_line() +
ggtitle("Stock Price Daily Change of Google")
p4 = ggplot(data = df %>% filter(Stock == "Facebook") %>% arrange(Date), aes(x = Date, y = diff)) +
geom_line() +
ggtitle("Stock Price Daily Change of Facebook")
grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)
ggplot(data = cor(df %>% select(Close, Open, Volume, diff, High, Low)) %>% reshape2::melt(), aes(x=Var1, y=Var2, fill=value)) +
geom_tile() +
ggtitle("Correltion matrix")
naive.lm = lm(data = df, formula = Close ~ as.factor(Stock) + Date)
naive.lm %>% summary()
##
## Call:
## lm(formula = Close ~ as.factor(Stock) + Date, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -253.24 -105.74 -17.71 78.96 787.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.325e+03 4.643e+01 -71.62 <2e-16 ***
## as.factor(Stock)Apple -4.571e+02 4.951e+00 -92.33 <2e-16 ***
## as.factor(Stock)Facebook -4.759e+02 4.956e+00 -96.02 <2e-16 ***
## as.factor(Stock)Google 7.450e+01 4.951e+00 15.05 <2e-16 ***
## Date 2.350e-01 2.796e-03 84.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 135.4 on 5976 degrees of freedom
## Multiple R-squared: 0.8238, Adjusted R-squared: 0.8237
## F-statistic: 6984 on 4 and 5976 DF, p-value: < 2.2e-16
plot(shuffle(naive.lm$residuals), main = "Naive Residual Plot", ylab = "Residual")
df = df %>% mutate(naive.pred = naive.lm$fitted.values)
rmse(df$Close, df$naive.pred)
## [1] 135.3863
\[ \begin{aligned} Close_t &= ARIMA_{(p, q, d) \times (P, Q, D)_s}(Close_{t-1, \cdots}) + \beta_s \end{aligned} \]
amazon.stock = df %>% filter(Stock == "Amazon") %>% arrange(Date)
ggtsdisplay(amazon.stock %>% select(Close), main = "Difference of Price")
ts.model = auto.arima(amazon.stock %>% select(Close), seasonal = TRUE)
ggtsdisplay(ts.model$residuals, main = "Arima(2,2,0)")
ts.model %>% summary()
## Series: amazon.stock %>% select(Close)
## ARIMA(2,2,0)
##
## Coefficients:
## ar1 ar2
## -0.6589 -0.3277
## s.e. 0.0246 0.0246
##
## sigma^2 estimated as 172.1: log likelihood=-5968.77
## AIC=11943.54 AICc=11943.55 BIC=11959.47
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.005769807 13.10128 8.084945 -0.002588783 1.493505 1.193601
## ACF1
## Training set -0.08137831
amazon.stock = amazon.stock %>% mutate(ts.residuals = ts.model$residuals)
res.lm.model = lm(data = amazon.stock, formula = ts.residuals ~ 1)
res.lm.model %>% summary()
##
## Call:
## lm(formula = ts.residuals ~ 1, data = amazon.stock)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77.134 -5.187 -0.392 5.086 126.534
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.00577 0.33873 -0.017 0.986
##
## Residual standard error: 13.11 on 1496 degrees of freedom
amazon.stock = amazon.stock %>%
mutate(ts.pred = ts.model$fitted + res.lm.model$fitted.values)
amazon.stock$residuals = c(amazon.stock$Close - amazon.stock$ts.pred)
ggplot(data = amazon.stock, aes(x = Date)) +
geom_line(aes(y = Close, color = "red")) +
geom_line(aes(y = ts.pred, color = "blue")) +
xlab("time") +
ylab("Close Price of Amazon Stock") +
ggtitle("ARIMA(2,2,0)")
rmse(amazon.stock$Close, amazon.stock$naive.pred)
## [1] 203.9826
rmse(amazon.stock$Close, amazon.stock$ts.pred)
## [1] 13.10128
forecast(ts.model, 30) %>% plot(xlim=c(1250, 1520))
plot(amazon.stock$residuals, ylab = "Residuals", main = "Residual Plot of ARIMA")
apple.stock = df %>% filter(Stock == "Apple") %>% arrange(Date)
ggtsdisplay(apple.stock %>% select(Close), main = "Difference Price")
ts.model = auto.arima(apple.stock %>% select(Close), seasonal = TRUE)
ggtsdisplay(ts.model$residuals, main = "Arima(2,1,2)")
ts.model %>% summary()
## Series: apple.stock %>% select(Close)
## ARIMA(2,1,2)
##
## Coefficients:
## ar1 ar2 ma1 ma2
## -0.6752 -0.9108 0.6999 0.8846
## s.e. 0.0493 0.0537 0.0567 0.0575
##
## sigma^2 estimated as 2.656: log likelihood=-2851.39
## AIC=5712.79 AICc=5712.83 BIC=5739.34
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.05674586 1.626949 1.149444 0.0358611 1.099604 1.000475
## ACF1
## Training set 0.01167058
apple.stock = apple.stock %>% mutate(ts.residuals = ts.model$residuals)
res.lm.model = lm(data = apple.stock, formula = ts.residuals ~ 1)
res.lm.model %>% summary()
##
## Call:
## lm(formula = ts.residuals ~ 1, data = apple.stock)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7589 -0.7590 0.0060 0.8808 7.5666
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.05675 0.04204 1.35 0.177
##
## Residual standard error: 1.627 on 1496 degrees of freedom
apple.stock = apple.stock %>%
mutate(ts.pred = ts.model$fitted + res.lm.model$fitted.values)
apple.stock$residuals = c(apple.stock$Close - apple.stock$ts.pred)
ggplot(data = apple.stock, aes(x = Date)) +
geom_line(aes(y = Close, color = "red")) +
geom_line(aes(y = ts.pred, color = "blue")) +
xlab("time") +
ylab("Close Price of Apple Stock") +
ggtitle("ARIMA(2,1,2)")
rmse(apple.stock$Close, apple.stock$naive.pred)
## [1] 121.063
rmse(apple.stock$Close, apple.stock$ts.pred)
## [1] 1.625959
forecast(ts.model, 30) %>% plot(xlim=c(1250, 1520))
plot(apple.stock$residuals, ylab = "Residuals", main = "Residual Plot of ARIMA")
google.stock = df %>% filter(Stock == "Google") %>% arrange(Date)
ggtsdisplay(google.stock %>% select(Close), main = "Difference of Price")
ts.model = auto.arima(google.stock %>% select(Close), seasonal = TRUE)
ggtsdisplay(ts.model$residuals, main = "Arima(1,1,1)")
ts.model %>% summary()
## Series: google.stock %>% select(Close)
## ARIMA(1,1,1) with drift
##
## Coefficients:
## ar1 ma1 drift
## -0.6977 0.7459 0.5135
## s.e. 0.1781 0.1660 0.2491
##
## sigma^2 estimated as 87.94: log likelihood=-5469.76
## AIC=10947.52 AICc=10947.55 BIC=10968.76
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0009613606 9.365017 6.13327 -0.01538857 0.9656908 1.00196
## ACF1
## Training set -0.004049327
google.stock = google.stock %>% mutate(ts.residuals = ts.model$residuals)
res.lm.model = lm(data = google.stock, formula = ts.residuals ~ 1)
res.lm.model %>% summary()
##
## Call:
## lm(formula = ts.residuals ~ 1, data = google.stock)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.552 -3.926 -0.184 4.216 91.539
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0009614 0.2421267 0.004 0.997
##
## Residual standard error: 9.368 on 1496 degrees of freedom
google.stock = google.stock %>%
mutate(ts.pred = ts.model$fitted + res.lm.model$fitted.values)
google.stock$residuals = c(google.stock$Close - google.stock$ts.pred)
ggplot(data = google.stock, aes(x = Date)) +
geom_line(aes(y = Close, color = "red")) +
geom_line(aes(y = ts.pred, color = "blue")) +
xlab("time") +
ylab("Close Price of Google Stock") +
ggtitle("ARIMA(1,1,1)")
rmse(google.stock$Close, google.stock$naive.pred)
## [1] 84.95307
rmse(google.stock$Close, google.stock$ts.pred)
## [1] 9.365017
forecast(ts.model, 30) %>% plot(xlim=c(1250, 1520))
plot(google.stock$residuals, ylab = "Residuals", main = "Residual Plot of ARIMA")
facebook.stock = df %>% filter(Stock == "Facebook") %>% arrange(Date)
ggtsdisplay(facebook.stock %>% select(Close), main = "Difference of Price")
ts.model = auto.arima(facebook.stock %>% select(Close), seasonal = TRUE)
ggtsdisplay(ts.model$residuals, main = "Arima(2,1,2)")
ts.model %>% summary()
## Series: facebook.stock %>% select(Close)
## ARIMA(2,1,2) with drift
##
## Coefficients:
## ar1 ar2 ma1 ma2 drift
## -0.0468 0.8310 0.0467 -0.8966 0.0882
## s.e. 0.0658 0.0661 0.0545 0.0550 0.0304
##
## sigma^2 estimated as 2.819: log likelihood=-2882.04
## AIC=5776.08 AICc=5776.14 BIC=5807.92
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.00257673 1.675672 1.12037 -0.106676 1.516613 0.996834
## ACF1
## Training set 0.00453638
facebook.stock = facebook.stock %>% mutate(ts.residuals = ts.model$residuals)
res.lm.model = lm(data = facebook.stock, formula = ts.residuals ~ 1)
res.lm.model %>% summary()
##
## Call:
## lm(formula = ts.residuals ~ 1, data = facebook.stock)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.4620 -0.7034 -0.0093 0.8094 14.3003
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.002577 0.043425 -0.059 0.953
##
## Residual standard error: 1.676 on 1489 degrees of freedom
facebook.stock = facebook.stock %>%
mutate(ts.pred = ts.model$fitted + res.lm.model$fitted.values)
facebook.stock$residuals = c(facebook.stock$Close - facebook.stock$ts.pred)
ggplot(data = facebook.stock, aes(x = Date)) +
geom_line(aes(y = Close, color = "red")) +
geom_line(aes(y = ts.pred, color = "blue")) +
xlab("time") +
ylab("Close Price of Apple Stock") +
ggtitle("ARIMA(2,1,2)")
rmse(facebook.stock$Close, facebook.stock$naive.pred)
## [1] 98.9735
rmse(facebook.stock$Close, facebook.stock$ts.pred)
## [1] 1.67567
forecast(ts.model, 30) %>% plot(xlim=c(1250, 1520))
plot(facebook.stock$residuals, ylab = "Residuals", main = "Residual Plot of ARIMA")